repository: make check and delete work at N=1 with sha256 pack ids by mr-raj12 · Pull Request #9783 · borgbackup/borg

mr-raj12 · 2026-06-16T10:04:05Z

Description

borg check, delete and compact all assumed that a pack file's name is the chunk_id it holds. That is true only at N=1 while pack ids equal chunk ids. With BORG_TESTONLY_SHA256_PACK_ID=1 (and later at N>1) a pack is named sha256(pack_bytes), so the file name is the pack_id, not the chunk_id.

The object header already carries the chunk_id, meta size and data size (repoobj.py), so the chunk index can be rebuilt from the headers (real chunk_ids) instead of the file names, and removal can find the pack through the index.

Changes:

repoobj.py: add RepoObj.iter_object_headers(), which walks a pack's object headers and yields (chunk_id, obj_offset, obj_size). Works for one object per pack (N=1) and for several (N>1).
repository.py:
- delete() becomes a logged no-op. A pack may hold several objects, so we can not remove one object by dropping its pack without losing the others; real removal goes through store_delete at the pack level.
- list() iterates the chunk index and yields chunk_ids, instead of listing packs/ (which would yield pack_ids).
cache.py: build_chunkindex_from_repo() rebuilds the index from store_list + iter_object_headers (real chunk_id/offset/size) rather than calling Repository.list(), which would recurse into the very index it is building. Adds an init_flags argument so compact can start the chunks as unused and compute usage itself.
compact_cmd.py: --stats builds the index from a real header scan instead of a pack_id == chunk_id fake; the delete loop drops each unused chunk's pack via store_delete.
archive.py / debug_cmd.py: check --repair and debug delete-obj resolve the pack via the chunk index and then store_delete it; debug delete-obj maps a missing pack (stale index entry) back to "not found".
tests: a delete_chunk helper drops a chunk's pack at the store level (used by the check / extract / mount damage tests), since Repository.delete no longer removes anything; test_empty_repository drops packs through the store; test objects get a unique chunk_id so identical payloads do not collapse into one pack under sha256.

Verified with the local test subset, with and without BORG_TESTONLY_SHA256_PACK_ID=1: the repository- and archive-level missing-chunk checks pass under the switch and nothing regresses without it. Archive --verify-data corruption detection under sha256 pack ids (a few check --verify-data tests) is left for a separate change.

refs #8572

Checklist

PR is against master
New code has tests and docs where appropriate
Tests pass (ran the relevant test subset)
Commit messages are clean and reference related issues

codecov · 2026-06-16T10:25:12Z

Codecov Report

❌ Patch coverage is 84.21053% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.84%. Comparing base (2eef6c8) to head (4121509).
⚠️ Report is 14 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/borg/cache.py	69.23%	3 Missing and 1 partial ⚠️
src/borg/repository.py	83.33%	1 Missing and 2 partials ⚠️
src/borg/archive.py	50.00%	1 Missing ⚠️
src/borg/archiver/debug_cmd.py	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9783      +/-   ##
==========================================
- Coverage   84.88%   84.84%   -0.05%     
==========================================
  Files          92       92              
  Lines       15172    15166       -6     
  Branches     2275     2281       +6     
==========================================
- Hits        12879    12867      -12     
- Misses       1589     1594       +5     
- Partials      704      705       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

A pack is named sha256(pack_bytes), not chunk_id. delete becomes a no-op (real removal is store_delete at the pack level); list iterates the chunk index; build_chunkindex_from_repo reads pack headers for real chunk_ids and gains init_flags so compact computes usage; tests drop a chunk's pack.

ThomasWaldmann

Stuff I found.

ThomasWaldmann · 2026-06-16T21:06:28Z

+    chunk index, since the pack file name is the pack_id, which need not equal the chunk_id.
+    """
+    entry = repository.chunks.get(id)
+    repository.store_delete("packs/" + bin_to_hex(entry.pack_id))


if entry is not None: ... else: ...

ThomasWaldmann · 2026-06-16T21:19:38Z

+        # delete is a no-op now (see Repository.delete), so the object stays retrievable.
        repository.delete(key50)
-        with pytest.raises(Repository.ObjectNotFound):
-            repository.get(key50)
+        assert pdchunk(repository.get(key50)) == b"SOMEDATA"


Maybe rather don't test Repository.delete at all than testing some behaviour that doesn't make sense.

ThomasWaldmann · 2026-06-16T21:31:02Z

+                        # N=1: the defect chunk is alone in its pack; drop the pack. N>1 needs compaction.
+                        pack_id = self.chunks[defect_chunk].pack_id
                        del self.chunks[defect_chunk]
-                        self.repository.delete(defect_chunk)
+                        self.repository.store_delete("packs/" + bin_to_hex(pack_id))


what part of "don't do actions that are very harmful for the general case" didn't you understand?

you are killing a complete pack here because one chunk was problematic.

ThomasWaldmann · 2026-06-16T21:39:55Z

+        for info in repository.store_list("packs"):
+            pack_id = hex_to_bin(info.name)
+            pack = repository.store_load("packs/" + info.name)
+            for chunk_id, obj_offset, obj_size in RepoObj.iter_object_headers(pack):
+                num_chunks += 1
+                chunks[chunk_id] = ChunkIndexEntry(
+                    flags=init_flags, size=0, pack_id=pack_id, obj_offset=obj_offset, obj_size=obj_size
+                )


This will read all the data that is stored in all packs.

There needs to be at least a TODO to improve efficiency by only loading the RepoObj headers from the packfile and not also all metadata and data.

This is also bit tricky considering how the borgstore internal caching works right now (IF the Store is configured with a cache): on first access, it will load the complete pack. So even if borg would only read little from the pack, it would fetch the whole pack anyway.

ThomasWaldmann · 2026-06-16T21:45:28Z

+                    flags=init_flags, size=0, pack_id=pack_id, obj_offset=obj_offset, obj_size=obj_size
+                )
+    else:
+        # Legacy repo: list() reads its own segment index (no recursion). get() routes through that


That "segment index" used to be called "repository index" in borg1.

Is that legacy part even used? By what?

ThomasWaldmann · 2026-06-16T21:50:09Z

        return data[hdr_size + hdr.meta_size :]  # crypted data

+    @classmethod
+    def iter_object_headers(cls, pack: bytes):


Problematic: API requires giving the complete pack contents.

Could do partial loads to only load the interesting parts of the pack and skip 99% of the data it does not need.

ThomasWaldmann · 2026-06-16T21:54:19Z

                if key <= last_key_checked:  # needs sorted keys
                    continue
                try:
                    obj = self.store.load(key)


keeping the old name here is problematic. obj would make readers think this is a repoobj, while in reality it is a complete pack.

ThomasWaldmann · 2026-06-16T22:14:25Z

        info = dict(id=self.id, version=self.version)
        return info

    def check(self, repair=False, max_duration=0):


guess Repository.check rather needs a rewrite than the little changes you do here.

context:

borg1 check --repository-only for a remote (ssh:) repository ran the Repository code ON THE REPO SERVER. Thus, reading all data was just local file I/O on the server and acceptable.

borg2 does most things client-side by default and the remoting happens inside borgstore - the Repository class code is NOT running on the repo server, but always on the client. Thus, reading all data from all packs is not acceptable because it would transfer everything over the network. Some borgstore code IS running on the repo server though (at least for "rest:" repos), e.g. Store.hash can be done repo-server-side.

For borg2 check --repository-only we could do it like this:

iterate over all pack ids

use store.hash(packname) to compute the hash from the data ON THE REPO SERVER

compare computed hash with pack id

if it matches, the pack does not have corruption.

if it does not match, the pack is corrupted. without --repair just report that. with --repair we need to fetch it, iterate over all repoobjs and parse them into meta and data (this decrypts and find corrupted repoobjs because the AEAD authentication+decryption fails in the authentication step). drop corrupted repoobjs, write good repoobjs into new pack, update the index.

The index could be verified in a similar way, just comparing hashes. If there is a mismatch, emit a warning so that the user will rebuild the index.

Guess borg check without --repair would not rebuild an index by default (too expensive/slow), but just assume that the index is ok IF the hashes are ok.

Does not necessarily need to be in this PR, you decide.

ThomasWaldmann · 2026-06-16T22:28:30Z

+        # <marker> ourselves; index order is stable unless the index is mutated, which is all the
+        # marker pagination needs.
+        self._lock_refresh()
+        collect = marker is None


Yeah, that is a better expression than the original code.

ThomasWaldmann · 2026-06-16T22:41:40Z

Good, failures down to 6 in sha256-packid mode. \o/

borg check --archives-only --verify-data seems to be broken, it exits with 0 although a file chunk is corrupted.

Maybe check if that is a indexing problem (index not updated after put(chunkid, corrupted_data)) or if there is something wrong in the check code still.

Hmm, maybe this is the wrong way to test that. The idea of that test was to simulate a corrupted file chunk (like bit rot), but that put() also writes it to a new packid and also it does a put for a chunkid we already have in the repo (that is an action that borg usually never does).

Guess it would be better and also more realistic to low-level corrupt the chunk data inside the pack without using put() and without touching the index.

ThomasWaldmann · 2026-06-16T23:24:11Z

BTW, if you address review feedback, can you rather do new commits than amending please - that is easier for review.

At the end, git rebase -i can be used to squash stuff together.

ThomasWaldmann requested changes Jun 16, 2026

View reviewed changes

Comment thread src/borg/testsuite/archiver/check_cmd_test.py Outdated

Comment thread src/borg/testsuite/repository_test.py Outdated

Comment thread src/borg/repoobj.py Outdated

Comment thread src/borg/repository.py Outdated

mr-raj12 force-pushed the repository-check-delete-sha256-pack-ids branch from 28c817b to f2ca1ff Compare June 16, 2026 20:34

mr-raj12 force-pushed the repository-check-delete-sha256-pack-ids branch from f2ca1ff to 4121509 Compare June 16, 2026 20:39

ThomasWaldmann requested changes Jun 16, 2026

View reviewed changes

Uh oh!

Conversation

mr-raj12 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

codecov Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ThomasWaldmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasWaldmann Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasWaldmann Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasWaldmann commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasWaldmann commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mr-raj12 commented Jun 16, 2026 •

edited

Loading

codecov Bot commented Jun 16, 2026 •

edited

Loading

ThomasWaldmann Jun 16, 2026 •

edited

Loading

ThomasWaldmann Jun 16, 2026 •

edited

Loading

ThomasWaldmann commented Jun 16, 2026 •

edited

Loading